Capital Bikeshare (also called CapBi) is a bicycle sharing system that serves Washington DC, Arlington County, Alexandria, Falls Church, Montgomery County, Prince George’s County, and Fairfax County. The Capital Bikeshare system is owned by the local governments and is operated by Motivate International, Inc.(Motivate International, Inc). As of August 2019, Capital Bike has 500 stations and 4300 bicycles.
The distribution of the docks is shown below:
As we can see from the above image, the majority of the docks for the bicycle are in Washington DC.
Bike tours in Washington DC are not only a popular family activity but renting a bike is a great way to get around without breaking the bank or sitting in traffic. There are dedicated bike lanes in Washington DC hence there is safety and convenience for the rider.
Capital BikeShare is undoubtedly cheaper than its competitors and the docks are conveniently placed around monumental locations. Capital Bikeshare is often faster than other modes of transportation and its annual membership offers unlimited trips under 30 minutes which helps save money. CapBi can be used to commute to work or ride to meet friends and is a great alternative for exercise since it is human-powered instead of electric powered. CapBi services save fuel, prevents carbon emissions, it is not only healthy for the rider but also for the environment.
As CapBi services are very popular and always in demand, we want to predict the number of bikes riders will use per hour and have contingencies to fulfill the demand. To estimate the number of bikes required we will consider various factors such as weather, temperature, working or non-working hour, the hour of the day, etc.
Fun Fact: CapBi offers GWU students annual membership for only 25$.(“Capital Bikeshare Discount”)
The data is sourced from the official Capital Bikeshare website, https://www.capitalbikeshare.com/system-data. We have downloaded data for September 2013 to September 2019.
The official data contains only the following variables:
| Variable | Description |
|---|---|
| Duration | Duration of trip |
| Start Date | Includes start date and time |
| End Date | Includes end date and time |
| Start Station | Includes starting station name and number |
| End Station | Includes ending station name and number |
| Bike Number | Includes ID number of bike used for the trip |
| Member Type | Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass) |
We dropped irrelevant columns like Duration, End Date, End Station, Bike Number and Member Type from official capital bike share dataset.
To predict the number of bikes to be used hourly we scraped the weather data from following website: https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/.
To also understand whether holiday influences the increase or decrease in bike usage we downloaded the holiday dataset from https://www.kaggle.com/gsnehaa21/federal-holidays-usa-19662020.
We merged all the different data sources into a single file and the structure for that file is as follows:
## 'data.frame': 6627646 obs. of 16 variables:
## $ Start.date : Factor w/ 52101 levels "2013-10-01 00:00:00",..: 2 9 10 11 12 13 14 15 16 17 ...
## $ Start.station : Factor w/ 658 levels "10th & E St NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Condition : Factor w/ 47 levels "","Cloudy","Cloudy / Windy",..: 29 2 27 27 29 29 29 29 29 29 ...
## $ Wind : Factor w/ 19 levels "","CALM","E",..: 15 19 2 3 12 6 6 10 19 17 ...
## $ Temperature..F. : num 63 64 66 67 71 79 81 83 84 84 ...
## $ Dew.Point..F. : num 55 56 56 57 58 58 58 58 55 55 ...
## $ Humidity.... : num 75 75 70 70 63 48 45 42 37 37 ...
## $ Wind.Speed..mph.: num 3 3 0 3 3 7 8 6 6 9 ...
## $ Wind.Gust..mph. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Pressure..in. : num 30 30.1 30.1 30.1 30.1 ...
## $ Precip...in. : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Holiday : Factor w/ 10 levels "Birthday of Martin Luther King, Jr.",..: NA NA NA NA NA NA NA NA NA NA ...
## $ weekday : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
## $ timeOfDay : Factor w/ 2 levels "Non Working Hour",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ season : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ noOfBikes : int 1 2 1 3 2 4 3 3 2 5 ...
Once the merging of data is done we will preprocess our data. We dropped irrelevant columns like Start.station, Wind, Wind.Gust..mph., Pressure..in., Precip…in. as they are not useful for our analysis.
The variable Condition has 47 unique levels:
## [1] Cloudy
## [3] Cloudy / Windy Fair
## [5] Fair / Windy Fog
## [7] Haze Heavy Rain
## [9] Heavy Rain / Windy Heavy Snow
## [11] Heavy T-Storm Heavy T-Storm / Windy
## [13] Light Drizzle Light Drizzle / Windy
## [15] Light Freezing Drizzle Light Freezing Rain
## [17] Light Rain Light Rain / Windy
## [19] Light Rain with Thunder Light Sleet
## [21] Light Sleet / Windy Light Snow
## [23] Light Snow / Windy Light Snow and Sleet
## [25] Light Snow and Sleet / Windy Mist
## [27] Mostly Cloudy Mostly Cloudy / Windy
## [29] Partly Cloudy Partly Cloudy / Windy
## [31] Patches of Fog Rain
## [33] Rain / Windy Rain and Sleet
## [35] Rain and Snow Shallow Fog
## [37] Sleet Snow
## [39] Snow and Sleet Squalls / Windy
## [41] T-Storm T-Storm / Windy
## [43] Thunder Thunder / Windy
## [45] Thunder in the Vicinity Wintry Mix
## [47] Wintry Mix / Windy
## 47 Levels: Cloudy Cloudy / Windy Fair Fair / Windy Fog ... Wintry Mix / Windy
We condense Condition column from 47 levels into 6 levels.
If condition is Cloudy,Cloudy / Windy,Mostly Cloudy,Mostly Cloudy / Windy,Partly Cloudy,Partly Cloudy / Windy we replace it by Cloudy alone. Similar logic is used of other weather conditions as well.
We finally have the following levels in Condition column:
## [1] "Cloudy" "Fair" "Fog" "Rain" "Snow" "Windy"
The Holiday column has the following levels:
## [1] <NA>
## [2] Columbus Day
## [3] Veterans Day
## [4] Thanksgiving Day
## [5] Christmas Day
## [6] New Year's Day
## [7] Birthday of Martin Luther King, Jr.
## [8] Washington's Birthday
## [9] Memorial Day
## [10] Independence Day
## [11] Labor Day
## 10 Levels: Birthday of Martin Luther King, Jr. ... Washington's Birthday
We convert the Holiday column from factors into a binary column where 0 means no Holiday and 1 means Holiday.
Since our dataset has information about the number of bikes used per hour across various stations, we want to simplify this thus we aggregate all the CapBi data on an hourly basis.
We also rename the columns for ease of use. We use the lubridate package to extract the hour, month, day and year from the Start_Date column which is of type character. For example 2019-09-20 18:00:00 is the date and thus the hour is 18, month is 09, day is 20 and year is 2019.
We convert the following columns to factors HourOfDay, Month, Year, Day, Condition, Holiday, Weekday, TimeofDay, Season and the final processed dataframe looks as follows:
## Classes 'tbl_df', 'tbl' and 'data.frame': 52097 obs. of 15 variables:
## $ Start_Date : Factor w/ 52101 levels "2013-10-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Condition : Factor w/ 6 levels "Cloudy","Fair",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ Temp : num 62 63 63 61 60 61 61 63 64 66 ...
## $ Dew : num 52 55 55 55 55 55 56 56 56 56 ...
## $ Humidity : num 70 75 75 81 83 81 83 78 75 70 ...
## $ Windspeed : num 3 3 5 6 5 5 5 3 3 0 ...
## $ Holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Weekday : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
## $ TimeofDay : Factor w/ 2 levels "Non Working Hour",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Season : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HourOfDay : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 7 levels "2013","2014",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Total_Bikes: int 38 41 23 5 4 21 108 396 1044 667 ...
The first 6 rows of the final processed dataset are:
| Start_Date | Condition | Temp | Dew | Humidity | Windspeed | Holiday | Weekday | TimeofDay | Season | HourOfDay | Month | Day | Year | Total_Bikes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-10-01 00:00:00 | Cloudy | 62 | 52 | 70 | 3 | 0 | Weekday | Non Working Hour | Fall | 0 | 10 | 1 | 2013 | 38 |
| 2013-10-01 01:00:00 | Cloudy | 63 | 55 | 75 | 3 | 0 | Weekday | Non Working Hour | Fall | 1 | 10 | 1 | 2013 | 41 |
| 2013-10-01 02:00:00 | Cloudy | 63 | 55 | 75 | 5 | 0 | Weekday | Non Working Hour | Fall | 2 | 10 | 1 | 2013 | 23 |
| 2013-10-01 03:00:00 | Fair | 61 | 55 | 81 | 6 | 0 | Weekday | Non Working Hour | Fall | 3 | 10 | 1 | 2013 | 5 |
| 2013-10-01 04:00:00 | Fair | 60 | 55 | 83 | 5 | 0 | Weekday | Non Working Hour | Fall | 4 | 10 | 1 | 2013 | 4 |
| 2013-10-01 05:00:00 | Cloudy | 61 | 55 | 81 | 5 | 0 | Weekday | Non Working Hour | Fall | 5 | 10 | 1 | 2013 | 21 |
In this analysis, the question that we want to answer or predict is -
Since we changed our topic/dataset for this this project, we have done a quick EDA to understand the underlying patterns before moving to the models. This helps us understand what to expect from the models.
From the plots below we can see that winter is the least favorite season for hiring bikes while spring, summer, and fall have pretty similar patterns, which makes sense because roads covered with snow can make it difficult to cycle, so the demand for bikes goes down in the winter season.
In this plot we also include the temperature, and observe that higher numbers of bikes are rented as temperature increases, and the optimum temperature is between 80-90 degree Fahrenheit.
The plot shows that people like to bike most in cloudy weather, followed by fair. Rain, snow, windy etc. are not preferred.
There is a steady increase in the number of bikes rented up to the year 2017 and then it decreased in 2018. There is slight difference between bikes rented between weekdays and weekends, with more bikes rented during weekdays because that is when people commute to work.
The bikes hired peak during morning and evening 8 AM and 6 PM rush hours when people are heading to or returning back from work.
We notice that riders rent bike more often on days when there is no holiday, but the number of bikes rented during holidays is still significant. This could be because people like to commute within DC on a bike and do sight seeing on holidays
There is a positive 44% correlation between temperature and bikes hired, additionally, Humidity has a negative correlation of 30%.
The correlation table is as follows:| Total_Bikes | Temp | Dew | Humidity | Windspeed | HourOfDay | |
|---|---|---|---|---|---|---|
| Total_Bikes | 1.00 | 0.44 | 0.24 | -0.30 | 0.10 | 0.42 |
| Temp | 0.44 | 1.00 | 0.89 | 0.08 | -0.07 | 0.14 |
| Dew | 0.24 | 0.89 | 1.00 | 0.50 | -0.18 | -0.01 |
| Humidity | -0.30 | 0.08 | 0.50 | 1.00 | -0.29 | -0.29 |
| Windspeed | 0.10 | -0.07 | -0.18 | -0.29 | 1.00 | 0.15 |
| HourOfDay | 0.42 | 0.14 | -0.01 | -0.29 | 0.15 | 1.00 |
The correlation plot is as follows:
For our analysis we scale all the numeric variables like Temperature, Dew, Humidity and WindSpeed to avoid skewed results.
We now split our dataset into Train and Test splits. For the years 2013, 2014, 2015, 2016, 2017 and 2018 we are considering these samples for training our model and using the sample for the year 2019 we will validate the performance of the model.
In our training set we have total 45557 samples and for testing our model we have total 6540 samples.
We also drop irrelevant columns like Start_Date, Month, Day and Year in our training and test set, as these are not useful when creating models.
As we need to predict the number of bikes which is a numerical value, we have to perform regression analysis. We will create models using Linear Regression and then try to optimize it. We will use Decision Tree for regression and also try Bagged Decision Trees. Finally we will create a random forest model and try to tune it to get best results.
We perform linear regression, using training data set and all the variables we predict the number of bikes per hour which is our ‘y’ variable.
The summary of linear model is as follows:
##
## Call:
## lm(formula = y ~ ., data = Training_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -804.17 -117.77 -13.27 103.45 1114.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.028 5.049 32.290 < 2e-16 ***
## ConditionFair 10.105 2.499 4.044 5.27e-05 ***
## ConditionFog 5.225 11.510 0.454 0.649893
## ConditionRain -132.889 3.903 -34.046 < 2e-16 ***
## ConditionSnow -38.918 11.450 -3.399 0.000677 ***
## ConditionWindy -1.231 14.789 -0.083 0.933687
## Temp 157.142 6.888 22.814 < 2e-16 ***
## Dew -17.833 7.864 -2.268 0.023355 *
## Humidity -39.063 3.597 -10.861 < 2e-16 ***
## Windspeed -8.733 1.014 -8.615 < 2e-16 ***
## Holiday1 -70.324 5.780 -12.167 < 2e-16 ***
## WeekdayWeekend -25.409 2.110 -12.043 < 2e-16 ***
## TimeofDayWorking Hour -1.436 6.794 -0.211 0.832550
## SeasonSpring 9.831 2.739 3.590 0.000331 ***
## SeasonSummer -48.257 3.127 -15.433 < 2e-16 ***
## SeasonWinter -27.934 3.220 -8.676 < 2e-16 ***
## HourOfDay1 -34.212 6.569 -5.208 1.92e-07 ***
## HourOfDay2 -48.474 6.573 -7.375 1.67e-13 ***
## HourOfDay3 -59.357 6.614 -8.975 < 2e-16 ***
## HourOfDay4 -64.500 6.639 -9.716 < 2e-16 ***
## HourOfDay5 -38.759 6.580 -5.891 3.87e-09 ***
## HourOfDay6 48.466 6.577 7.369 1.75e-13 ***
## HourOfDay7 249.983 6.577 38.006 < 2e-16 ***
## HourOfDay8 610.005 6.568 92.873 < 2e-16 ***
## HourOfDay9 538.675 7.086 76.022 < 2e-16 ***
## HourOfDay10 268.663 9.444 28.449 < 2e-16 ***
## HourOfDay11 250.066 9.460 26.433 < 2e-16 ***
## HourOfDay12 331.171 9.487 34.907 < 2e-16 ***
## HourOfDay13 343.751 9.518 36.115 < 2e-16 ***
## HourOfDay14 312.528 9.549 32.731 < 2e-16 ***
## HourOfDay15 317.677 9.568 33.201 < 2e-16 ***
## HourOfDay16 403.508 9.576 42.138 < 2e-16 ***
## HourOfDay17 661.644 9.565 69.170 < 2e-16 ***
## HourOfDay18 771.909 8.041 95.998 < 2e-16 ***
## HourOfDay19 514.892 6.642 77.520 < 2e-16 ***
## HourOfDay20 321.842 6.599 48.773 < 2e-16 ***
## HourOfDay21 199.693 6.572 30.387 < 2e-16 ***
## HourOfDay22 131.580 6.558 20.065 < 2e-16 ***
## HourOfDay23 57.800 6.552 8.822 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 202.2 on 45518 degrees of freedom
## Multiple R-squared: 0.6971, Adjusted R-squared: 0.6968
## F-statistic: 2757 on 38 and 45518 DF, p-value: < 2.2e-16
We notice that the R-squared value for linear model with all the variables is 0.6971009.
We want to know how much of each variable contributes to the Linear model R-squared value i.e we want to know the relative importance of each variable in Linear model. For this we make use of the relaimpo package, and use the function calc.relimp.
| LMG | |
|---|---|
| Condition | 0.0256185 |
| Season | 0.0424971 |
| HourOfDay | 0.5769237 |
| Temp | 0.1340105 |
| Dew | 0.0643029 |
| Humidity | 0.0674104 |
| Windspeed | 0.0052728 |
| Holiday | 0.0017618 |
| Weekday | 0.0014998 |
| TimeofDay | 0.0807025 |
The relative importance matrix show us that hour of the day heavily influences the model. It contributes to more that fifty percentage of variation in our model. Similarly, temperature, time of the day, dew, humidity are other factors which impact variation in the model.
We have already evaluated the performance of linear model using all predictor variables, we now perform feature selection so that we can create a linear model with a subset of the variables without compromising on the accuracy.
The output for forward feature selection is as follows:
We observe that adjusted r2 is maxixum when variables like Temp, TimeofDay, HourofDay are included. Similary BIC and Cp value are minimum when these same variables are in included in the model. Hence, forward features selection suggest that including features such as Temp, TimeofDay, HourofDay increases accuracy of the prediction.
The output for backward feature selection is as follows:
Similarly, we observe that adjusted r2 is maxixum when variables like Temp, Humidity, HourofDay are included. BIC and Cp value are minimum when these same variables are in included in the model. Hence, backward features selection suggest that including features such as Temp, Humidity, HourofDay increases accuracy of the prediction.
Overall, from the results of forward/backward features selection we can conclude that variables like Temp, Humidity,TimeofDay HourOfDay are important.
First, we run the model with four selected variables Temp, Dew, TimeofDay, HourOfDay based on feature selection done in section 5.2.1.
##
## Call:
## lm(formula = y ~ Temp + Dew + TimeofDay + HourOfDay, data = Training_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1023.34 -115.83 -7.61 102.98 1105.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.798 4.767 28.276 < 2e-16 ***
## Temp 256.484 2.537 101.086 < 2e-16 ***
## Dew -133.618 2.464 -54.225 < 2e-16 ***
## TimeofDayWorking Hour -2.563 6.965 -0.368 0.713
## HourOfDay1 -31.786 6.736 -4.719 2.38e-06 ***
## HourOfDay2 -46.761 6.738 -6.940 3.98e-12 ***
## HourOfDay3 -56.292 6.778 -8.306 < 2e-16 ***
## HourOfDay4 -60.978 6.801 -8.967 < 2e-16 ***
## HourOfDay5 -35.352 6.739 -5.246 1.56e-07 ***
## HourOfDay6 51.203 6.734 7.604 2.94e-14 ***
## HourOfDay7 251.431 6.735 37.334 < 2e-16 ***
## HourOfDay8 610.836 6.729 90.777 < 2e-16 ***
## HourOfDay9 540.248 7.262 74.392 < 2e-16 ***
## HourOfDay10 267.754 9.678 27.666 < 2e-16 ***
## HourOfDay11 245.276 9.688 25.317 < 2e-16 ***
## HourOfDay12 322.804 9.704 33.264 < 2e-16 ***
## HourOfDay13 330.735 9.725 34.007 < 2e-16 ***
## HourOfDay14 296.593 9.745 30.435 < 2e-16 ***
## HourOfDay15 299.180 9.759 30.658 < 2e-16 ***
## HourOfDay16 383.777 9.762 39.312 < 2e-16 ***
## HourOfDay17 642.766 9.756 65.882 < 2e-16 ***
## HourOfDay18 754.525 8.195 92.072 < 2e-16 ***
## HourOfDay19 500.691 6.772 73.936 < 2e-16 ***
## HourOfDay20 313.747 6.746 46.510 < 2e-16 ***
## HourOfDay21 194.693 6.730 28.929 < 2e-16 ***
## HourOfDay22 129.808 6.722 19.310 < 2e-16 ***
## HourOfDay23 57.118 6.718 8.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.3 on 45530 degrees of freedom
## Multiple R-squared: 0.6813, Adjusted R-squared: 0.6811
## F-statistic: 3744 on 26 and 45530 DF, p-value: < 2.2e-16
Then, we remove the Timeofday as it has a high p-value and thus is not significant, and rerun the model with three remaining variables.
The summary of linear model, using Temp, Dew, HourOfDay as predictors is as follows:
##
## Call:
## lm(formula = y ~ Temp + Dew + HourOfDay, data = Training_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1023.35 -115.83 -7.61 103.03 1105.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.798 4.767 28.277 < 2e-16 ***
## Temp 256.487 2.537 101.089 < 2e-16 ***
## Dew -133.622 2.464 -54.228 < 2e-16 ***
## HourOfDay1 -31.785 6.736 -4.719 2.38e-06 ***
## HourOfDay2 -46.761 6.738 -6.940 3.98e-12 ***
## HourOfDay3 -56.292 6.777 -8.306 < 2e-16 ***
## HourOfDay4 -60.978 6.800 -8.967 < 2e-16 ***
## HourOfDay5 -35.352 6.739 -5.246 1.56e-07 ***
## HourOfDay6 51.203 6.734 7.604 2.93e-14 ***
## HourOfDay7 251.431 6.735 37.335 < 2e-16 ***
## HourOfDay8 610.836 6.729 90.778 < 2e-16 ***
## HourOfDay9 539.236 6.721 80.230 < 2e-16 ***
## HourOfDay10 265.191 6.719 39.467 < 2e-16 ***
## HourOfDay11 242.712 6.733 36.051 < 2e-16 ***
## HourOfDay12 320.240 6.755 47.409 < 2e-16 ***
## HourOfDay13 328.170 6.784 48.374 < 2e-16 ***
## HourOfDay14 294.028 6.811 43.167 < 2e-16 ***
## HourOfDay15 296.615 6.830 43.426 < 2e-16 ***
## HourOfDay16 381.211 6.836 55.767 < 2e-16 ***
## HourOfDay17 640.201 6.827 93.772 < 2e-16 ***
## HourOfDay18 752.841 6.799 110.723 < 2e-16 ***
## HourOfDay19 500.690 6.772 73.936 < 2e-16 ***
## HourOfDay20 313.746 6.746 46.511 < 2e-16 ***
## HourOfDay21 194.693 6.730 28.929 < 2e-16 ***
## HourOfDay22 129.808 6.722 19.310 < 2e-16 ***
## HourOfDay23 57.118 6.718 8.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.3 on 45531 degrees of freedom
## Multiple R-squared: 0.6813, Adjusted R-squared: 0.6811
## F-statistic: 3894 on 25 and 45531 DF, p-value: < 2.2e-16
Based on the rules of each criterion, the variables having the largest adjusted R2, R2, and Cp value, or lowest BIC value would be considered. However, in our case, all four criteria reached the same four variables, which are temperature, dew point, time of the day, and hour of the day. So we built a new model with these four variables.
Furthermore, we find out that the model can be condensed into three variables by removing ‘TimeOfDay’ and keeping only Temp, Dew and HourOfDay.
As the results displayed below, the reduced model includes only 3 variables but can still achieve R2 as high as the full model. Therefore, we choose the reduced model over the full model and move on to the next phase of the optimization process.
| Type of Model | Model Formula | R-squared | Adjusted R-squared |
|---|---|---|---|
| Reduced Model | y ~ Temp + Dew + HourOfDay | 0.6813206 | 0.6811456 |
| Full Model | y ~ . | 0.6971009 | 0.696848 |
Temperature and Dew point have high VIF value, which also indicates that two of them are correlated and there is a possibility of multicollinearity.
| GVIF | |
|---|---|
| Temp | 6.740818 |
| Dew | 6.420717 |
| HourOfDay | 1.352372 |
Thus we create 2 more linear models, the first one includes Temp but excludes dew and summary for that model is as follows:
##
## Call:
## lm(formula = y ~ Temp + HourOfDay, data = Training_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -969.35 -118.18 -6.35 106.83 1166.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.068 4.906 23.660 < 2e-16 ***
## Temp 130.069 1.033 125.872 < 2e-16 ***
## HourOfDay1 -37.007 6.950 -5.325 1.01e-07 ***
## HourOfDay2 -54.397 6.951 -7.826 5.14e-15 ***
## HourOfDay3 -67.623 6.990 -9.675 < 2e-16 ***
## HourOfDay4 -75.328 7.011 -10.744 < 2e-16 ***
## HourOfDay5 -52.041 6.946 -7.492 6.90e-14 ***
## HourOfDay6 33.173 6.940 4.780 1.76e-06 ***
## HourOfDay7 233.731 6.940 33.677 < 2e-16 ***
## HourOfDay8 597.804 6.938 86.160 < 2e-16 ***
## HourOfDay9 534.906 6.934 77.141 < 2e-16 ***
## HourOfDay10 273.178 6.931 39.413 < 2e-16 ***
## HourOfDay11 265.537 6.933 38.301 < 2e-16 ***
## HourOfDay12 356.177 6.936 51.353 < 2e-16 ***
## HourOfDay13 376.520 6.939 54.262 < 2e-16 ***
## HourOfDay14 351.069 6.944 50.560 < 2e-16 ***
## HourOfDay15 359.180 6.946 51.709 < 2e-16 ***
## HourOfDay16 446.037 6.944 64.231 < 2e-16 ***
## HourOfDay17 702.493 6.944 101.169 < 2e-16 ***
## HourOfDay18 807.372 6.938 116.366 < 2e-16 ***
## HourOfDay19 545.384 6.935 78.641 < 2e-16 ***
## HourOfDay20 346.049 6.933 49.914 < 2e-16 ***
## HourOfDay21 215.456 6.933 31.078 < 2e-16 ***
## HourOfDay22 142.440 6.932 20.549 < 2e-16 ***
## HourOfDay23 63.666 6.930 9.187 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 213.9 on 45532 degrees of freedom
## Multiple R-squared: 0.6607, Adjusted R-squared: 0.6606
## F-statistic: 3695 on 24 and 45532 DF, p-value: < 2.2e-16
The second linear model excludes temp, but includes Dew and summary for that model is as follows:
##
## Call:
## lm(formula = y ~ Dew + HourOfDay, data = Training_Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1032.74 -114.45 -7.58 105.75 1212.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.224 5.259 18.488 < 2e-16 ***
## Dew 95.241 1.076 88.493 < 2e-16 ***
## HourOfDay1 -42.613 7.453 -5.718 1.09e-08 ***
## HourOfDay2 -62.680 7.454 -8.409 < 2e-16 ***
## HourOfDay3 -78.972 7.495 -10.536 < 2e-16 ***
## HourOfDay4 -89.695 7.518 -11.930 < 2e-16 ***
## HourOfDay5 -70.686 7.447 -9.492 < 2e-16 ***
## HourOfDay6 12.452 7.439 1.674 0.0942 .
## HourOfDay7 213.357 7.440 28.676 < 2e-16 ***
## HourOfDay8 582.496 7.439 78.300 < 2e-16 ***
## HourOfDay9 529.141 7.436 71.157 < 2e-16 ***
## HourOfDay10 280.384 7.433 37.720 < 2e-16 ***
## HourOfDay11 287.880 7.433 38.728 < 2e-16 ***
## HourOfDay12 391.723 7.433 52.698 < 2e-16 ***
## HourOfDay13 424.402 7.432 57.101 < 2e-16 ***
## HourOfDay14 407.816 7.433 54.862 < 2e-16 ***
## HourOfDay15 421.478 7.433 56.700 < 2e-16 ***
## HourOfDay16 510.403 7.431 68.689 < 2e-16 ***
## HourOfDay17 764.208 7.432 102.833 < 2e-16 ***
## HourOfDay18 861.166 7.430 115.910 < 2e-16 ***
## HourOfDay19 589.085 7.431 79.278 < 2e-16 ***
## HourOfDay20 377.687 7.431 50.823 < 2e-16 ***
## HourOfDay21 235.947 7.433 31.742 < 2e-16 ***
## HourOfDay22 154.825 7.433 20.829 < 2e-16 ***
## HourOfDay23 69.810 7.432 9.393 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229.4 on 45532 degrees of freedom
## Multiple R-squared: 0.6098, Adjusted R-squared: 0.6096
## F-statistic: 2965 on 24 and 45532 DF, p-value: < 2.2e-16
The model with temperature i.e. the first linear model has higher R2 value, thus the final linear model consists of Temperature and Hour of Day. Hence using 2 variables we can explain 66.07% of variation in our training dataset.
We want to know the relative importance of each variable for our final linear model, which consists of only 2 variables, Temp and HourOfDay. Using relaimpo package, we can conclude from the final linear model that the hour of the day variable provides high contribution for predicting the bike usage.
| LMG | |
|---|---|
| HourOfDay | 0.760272 |
| Temp | 0.239728 |
The plot of the relative importance is shown below:
After performing Regression using linear model, we now perform regression using Decision Trees. Decision Trees have the advantage that they are simple to create and can work with non-linear data.
To save computational time we have stored the Decision Tree model as RDS file, and read the RDS file for our predictions.
We have created a decision tree using Caret package. The “rpart2” method of caret’s train model creates a decision tree by tuning based on the depth of the decision tree. We have specified the max depth to be 10, by using the argument tuneLength=9. We have also performed 10-Fold cross validation and repeated this 3 times.
The decision tree is as follows:
## CART
##
## 45557 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 41001, 41002, 41001, 41000, 41002, 41001, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 337.1613 0.1568869 263.9271
## 2 319.9661 0.2407091 249.8055
## 4 292.9740 0.3633594 225.0676
## 5 281.2208 0.4134242 217.4486
## 6 269.7184 0.4604437 206.6960
## 7 260.2038 0.4978568 200.7196
## 8 251.3514 0.5314469 193.0507
## 9 241.6853 0.5667628 183.8102
## 10 206.6387 0.6833124 148.6568
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 10.
As we see from above summary the final depth used for creating decision tree is 10, and the plot for RMSE values with respect to max depth of decision tree is as follows:
As the depth of tree increases the performance increases i.e. the RMSE value of model decreases.
The Decision Tree which we will use for prediction is as follows::
The importance of the variables for decision tree is shown below:
We notice that the variables Temp, WeekdayWeekend, Dew, SeasonWinter and HourOfDay18 have high variable importance. The importance of variable depends on how high it shows up in the decision tree and also depends on number of times that variable repeats.(Therneau, 2019, p. 11)
We have trained our Decision Tree for 2013, 2014, 2015, 2016, 2017 and 2018 dataset. We now perform predictions based on 2019 dataset.
The evaluation metrics of Decision Tree is as follows:
| R2 | RMSE | MAE |
|---|---|---|
| 0.7319768 | 194.6294 | 139.3643 |
Decision Tree have a drawback, they are not flexible and perform poorly with a new sample of data. Thus we use Bagging which is a combination of Bootstrapping the Data and performing Aggregation.
Bagged Decision Trees creates an ensemble of decision trees. Bagged Decision Trees overcome the drawback of Decision Trees by estimating the value based on majority or averages.
To save computational time we have stored the Bagged Decision Tree model as RDS file, and read the RDS file for our predictions.
We have created a bagged decision tree using Caret package. The “treebag” method of caret’s train model creates a bagged decision tree. For bagged Decision tree there is no tuning parameter. We have performed 10-Fold cross validation and repeated this 3 times.
The bagged decision tree is as follows:
## Bagged CART
##
## 45557 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 41002, 41001, 41001, 41001, 41002, 41002, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 199.3304 0.7060689 144.2645
The importance of the variables for decision tree is shown below:
We notice that the variables Temp, WeekdayWeekend, Dew, HourOfDay18 and SeasonWinter have high variable importance.
We have trained our Decision Tree for 2013, 2014, 2015, 2016, 2017 and 2018 dataset. We now perform predictions based on 2019 dataset.
The evaluation metrics of Decision Tree is as follows:
| R2 | RMSE | MAE |
|---|---|---|
| 0.7447782 | 190.1936 | 136.4659 |
Thus using bagged decision tree we got slightly higher r-squared values in comparison to decision trees.
Finally, we model using Random Forest.
In bagged tree model, different samples of dataset is taken and model is trained to get best possible value. As all variables are used to train the model, there are chances we may overfit the model with full variables and there might be a chance most significant variable of the dataset can also be depressed by other variables in the dataset.
Lets try with limited number of variable and check its performance. we have 10 variables with us, do we need to try all different options? lets use information that we gained from previous models like linear regression, decision tree. More than 80% of the variable importance are obtained by top three variables. So lets start out hunt three variable model.
There are two predominant variable that finds to be important in previous model, but third variable is hard to choose. lets try all different combination of three variable to check the model performance.
We have used random forest package to train model, which takes mtry(number of variables to be used to train each samples) as n/3 by default for regression model.
To save computational time, we have trained the model and loaded the file to check its performance.
Lets use our model to predict the test model.
The evaluation metrics of Random Forest is as follows:
| R2 | RMSE | MAE |
|---|---|---|
| 0.9321908 | 97.85963 | 62.4085 |
This model gave us best R2 value so far. lets check importance of each variable to identify top three variable.
## %IncMSE IncNodePurity
## Condition 106.42494 163639162
## Temp 79.90901 862235383
## Dew 35.38582 295842195
## Humidity 62.20948 349574467
## Windspeed 50.81140 109738801
## Holiday 137.36333 53374646
## Weekday 450.50367 456264976
## TimeofDay 37.94861 417106486
## Season 41.75134 212546356
## HourOfDay 165.53598 3019676627
Mean Decrease Accuracy(%IncMSE) and Mean Decrease Gini(IncNodePurity) are calculated on the trained model.
Mean Decrease Accuracy(%IncMSE) - Refers to how much model accuracy decreases if we leave out that variable.
Mean Decrease Gini(IncNodePurity) - is the measure of variable importance based on the Gini impurity index used for calculating the splits in trees.
We could see HourOfDay, Weekday, Holiday, Condition, Temp are the most important variables similar to the other model’s relative importance.
Hang on! let’s try different combinations of variables from 1 to 5 with cross-validation(5 fold cross-validation) and also we repeat the same process for 3 three different samples to get averaged best output.
We tune/optimize Random Forest by using grid search and setting mtry in the range of 1 to 5.
To train the model takes quite some time, thus we have already trained the model, stored it and read directly from disk for faster execution and knitting of file.
Different variable models from 1 to 5 combinations, were tried and we could see R2 value increases as the number of variables increases.
## Random Forest
##
## 45557 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 36445, 36445, 36447, 36446, 36445, 36445, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 298.1751 0.6622144 234.8145
## 2 228.6833 0.7355588 175.3999
## 3 188.9903 0.7937174 140.6115
## 4 161.8658 0.8376181 116.6697
## 5 142.4377 0.8669464 99.4204
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
The plot describes that the Mean Square error decreases when the number of predictors increases(not surprising).
Let’s use this model, which used various combinations to get accurate prediction with less number of variables to predict the test data.
| R2 | RMSE | MAE |
|---|---|---|
| 0.8875139 | 133.6627 | 92.47352 |
With the maximum of 5 variable model combination, we could able to achieve a R2 of upto 0.88. Lets check which variables constitute more to this R2 value.
As most of the variables are categorical values, each level is treated as levels and we could see HourofDay and weekday, condition, Temp are the variables that contribute much to our model.
Hourofday - 18 is the best predictor, weekend or not, time of day with working day makes good observation points in determining the total number of bikes consumed by users for a given hour.
| Model | No. of Predictor | R2 [Test Data] | Variable Importance(desc) |
|---|---|---|---|
| Linear Regression | Full Model i.e. all variables | 0.69 | HourOfDay, Temp, TimeOfDay |
| Linear Regression | 2 | 0.67 | HourOfDay,Temp |
| Decision Tree | Full Model i.e. all variables | 0.73 | Temp, Weekday, Dew, Season, HourOfDay |
| Bagged Tree | Full Model i.e. all variables | 0.74 | Temp, Weekday, Dew, HourOfDay, Season |
| Random Forest | 3 | 0.93 | HourOfDay, Weekday, Holiday, Condition, Temp |
| Random Forest- Grid Search | Combination of 1 to 5 variables | 0.88 | HourOfDay, Weekday, Holiday, Temp, Holiday |
Thus random forest performs the best with the R-square value being highest at 0.93.
In terms of modelling & predictions, we can conclude that :
Random Forest works best with the given dataset
Maximum R2 value obtained is 0.93
Variable Importance are as follows:
Hour of Day – Best 6PM-7PM,8-9AM
Weekday – Weekend
Time of Day – Working Hour
Temp – Moderate Temperature – 70–90F
The insights which we got from our analysis is that on a normal day, users tend to ride a bike for commuting to offices, schools, etc. But on weekends & holidays, people prefer to use bikes for travel and leisure activity purposes. We also derive that bikes are preferred maximum in moderate temperatures and users tend to avoid bikes at high temperatures and low temperatures.
Based on our analysis we recommend that during high demand in morning and evening office hours and weekend/holiday, Capital Bikeshare should increase availability during these hours. Thus catering to more users and in turn, securing more profits.
Motivate International, Inc. (n.d.). Press Kit. Retrieved November 26, 2019, from https://www.capitalbikeshare.com/press-kit.
Capital Bikeshare Discount. (n.d.). Retrieved November 26, 2019, from https://benefits.gwu.edu/capital-bikeshare-discount.
Therneau, T. M. (2019, April 11). An Introduction to Recursive Partitioning Using the RPART Routines. Retrieved November 26, 2019, from https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.